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Abstract — We consider decentralized restless multi-armed ban- 
dit problems with unknown dynamics and multiple players. The 
reward state of each arm transits according to an unknown 
Markovian rule when it is played and evolves according to an 
arbitrary unknown random process when it is passive. Players 
activating the same arm at the same time collide and suffer 
from reward loss. The objective is to maximize the long-term 
reward by designing a decentralized arm selection policy to 
address unknown reward models and collisions among players. 
A decentralized policy is constructed that achieves a regret 
with logarithmic order when an arbitrary nontrivial bound 
on certain system parameters is known. When no knowledge 
about the system is available, we extend the policy to achieve a 
regret arbitrarily close to the logarithmic order. The result finds 
applications in communication networks, financial investment, 
and industrial engineering. 

I. Introduction 
A. The Classic MAB with A Single Player 

In the classic MAB, there are N independent arms and a 
single player. Each arm, when played, offers an i.i.d. random 
reward to the player The reward distribution of each arm 
is unknown. At each time, the player chooses one arm to 
play, aiming to maximize the total expected reward in the 
long run. This problem involves the well-known tradeoff 
between exploitation and exploration. For exploitation, the 
player should select the arm with the largest sample mean 
of reward. For exploration, the player should select an under- 
played arm to learn its reward statistics. 

Under the non-Bayesian formulation, the performance mea- 
sure of an arm selection policy is the so-called regret or 
the cost of learning defined as the reward loss with respect 
to the case with known reward models [l]. In 1985 Lai 
and Robbins showed that the minimum regret grows at a 
logarithmic order under certain regularity conditions [[T] . The 
best leading constant was also obtained, and an optimal policy 
was constructed to achieve the minimum regret growth rate 
(both the logarithmic order and the best leading constant). 
In 1987, Anantharam et al. extended Lai and Robbins's 
results to accommodate multiple simultaneous plays 121 and 
a Markovian reward model where the reward of each arm 
evolves as an unknown Markov process over successive plays 
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and remains frozen when the arm is passive (the so-called 
rested Markovian reward model) [3 |. 

Several other simpler policies have been developed to 
achieve logarithmic regret for the classic MAB under an i.i.d. 
reward model f4l, [31. In particular, the index policy — ^referred 
to as Upper Confidence Bound 1 (UCB-1) — proposed in 13] 
achieves the logarithmic regret with a uniform bound on the 
leading constant over time. In |6|, UCB-1 was extended to the 
rested Markovian reward model adopted in [3 1. 

B. Decentralized MAB with Distributed Multiple Players 

In IJl, Liu and Zhao formulated and studied a decentralized 
version of the classic MAB with M (M < N) distributed 
players under the i.i.d. reward model. Different arms can have 
different reward distributions and they are unknown to the 
players. At each time, a player chooses one arm to play based 
on its local observation and decision history without exchang- 
ing information with other players. Collisions occur when 
multiple players choose the same arm, and, depending on the 
collision model, either no one receives reward or the colliding 
players share the reward in an arbitrary way. The objective 
is to maximize the long-term sum reward from all players. 
Another desired feature of policies for decentralized MAB is 
fairness, i.e., different players have the same expected reward 
growth rate. Liu and Zhao proposed the Time Division Fair 
Sharing (TDFS) framework, it achieves the same logarithmic 
regret order as the centralized case where all players share their 
observations in learning and collisions are eliminated through 
centralized perfect scheduling |7|. Assuming a Bernoulli re- 
ward model, decentralized MAB was also addressed in [81, 
where the single-player policy UCB-1 was extended to the 
multi-player setting. 

C. Main Results 

In this paper, we consider the decentralized MAB with a 
restless Markovian reward model. In a single-player restless 
MAB, the reward state of each arm transits according to an 
unknown Markovian rule when played and transits according 
to an arbitrary unknown random process when passive as 
addressed in our prior work [0. In [0, we proposed a policy 
Restless UCB (RUCB), which achieves a logarithmic order of 
the weak regret defined as the reward loss compared to the 
case when the player knows which arm is the most rewarding 
and always plays the best arm. RUCB borrows the index form 



of UCB-1 given in Q and has a deterministic epoch structure 
with carefully chosen epoch lengths to balance exploration and 
exploitation. The concept of weak regret was first used in [10]; 
it measures the reward loss with respect to the optimal single- 
arm policy, which, while optimal under the i.i.d. and rested 
Markovian reward models (up to an 0(1) term of loss for the 
latter), is no longer optimal in general under a known restless 
reward model. Analysis of the strict regret of restless MAB is 
in general intractable given that finding the optimal policy of 
a restless bandit under known model is itself PSPACE-hard in 
general fTT|. 

In this paper, we extend RUCB proposed in our prior 
work jo] to a decentralized setting of restless MAB with 
multiple players. We consider two types of restless reward 
models: exogenous restless model and endogenous restless 
model. In the former, the system itself is rested: the state of an 
arm does not change when the arm is not engaged. However, 
from each individual player's perspective, arms are restless 
due to actions of other players that are unobservable and 
uncontrollable. Under the endogenous restless model, the state 
of an arm evolves according to an arbitrary unknown random 
process even when the arm is not played. Under both restless 
models, we extend RUCB to achieve a logarithmic order of the 
regret. The result for the exogenous restless model, however, 
is stronger in the sense that the regret is indeed defined with 
respect to the optimal policy under known reward models. This 
is possible due to the inherent rested nature of the systems. 

There are a couple of parallel work to (j9] on the single- 
player restless MAB. In |12|, Tekin and Liu adopted the weak 
regret and proposed a policy that achieves logarithmic (weak) 
regret when certain knowledge about the system parameters is 
available f\2\. The policy proposed in [ 12 1 also uses the index 
form of UCB-1 given in |5|, but the structure is different from 
RUCB proposed in ||9l . Specifically, under the policy proposed 
in 1121 . an arm is played consecutively for a random number 
of times determined by the regenerative cycle of a particular 
state, and observations obtained outside the regenerative cycle 
are not used in learning. RUCB, however, has a deterministic 
epoch structure, and all observations are used in learning. 
In ifTsl . the strict regret was considered for a special class 
of restless MAB. Specifically, when arms are governed by 
stochastically identical two-state Markov chains, a policy was 
constructed in lfT3l to achieve a regret with an order arbitrarily 
close to logarithmic. 

Notation For two positive integers k and I, define k 

1= {{k — 1) mod /) + 1, which is an integer taking values 
from 1, 2, • • • ,1. 

II. Problem Formulation 

In the decentralized MAB problem, we have M players 
and N independent arms. At each time, each player chooses 
one arm to play. Each arm, when played (activated), offers 
certain amount of reward that models the current state of the 
arm. Let Sj (t) and Sj denote the state of arm j at time t and 
the state space of arm j respectively. Different arms can have 
different state spaces. When arm j is played, its state changes 



according to a Markovian rule with Pj as the transition 
matrix. The transition matrixes are assumed to be irreducible, 
aperiodic, and reversible. As for the state transition of passive 
arms, we consider two models: endogenous restless model and 
exogenous restless model. In the endogenous restless model, 
arm states change in arbitrary ways when not played. In the 
exogenous restless model, arm states remain frozen if not 
engaged. The players do not know the transition matrices of 
the arms and do not communicate with each other Conflicts 
occur when different players select the same arm to play. 
Under different conflict models, either the players in conflict 
share the reward or no one obtains any reward. The objective 
is to maximize the expected total reward collected in the long 
run. Let ttJ = {7r^}se5 denote the stationary distribution of 
arm j (under Pj), where tt* is the stationary probability (under 
Pj) that arm j is in state s. The stationary mean reward fij 
is given by Hj = J2ses- ^''^s- Let cr be a permutation of 
{1, • • • ,N} such that 

Mcr(l) > ("(7(2) > ^J■a^3) > ■ ■ ■ > l^a(N)- 

A policy $ is a rule that specifies an arm to play based on the 
local observation history. Let tj{n) denote the time index of 
the nth play on arm j, and Tj{t) the total number of plays on 
arm j by time t. Notice that both tj [n) and Tj (t) are random 
variables with distributions determined by the policy $. Under 
the conflict model where players in conflict share the reward, 
the total reward by time t is given by 

N T,{t) 

R{t) = Y.T. 'oitjin))- (1) 

Under the conflict model where no players in conflict obtain 
any reward, the total reward by time t is given by 

N Tj{t) 
j=l n=l 

where Ij{tj{n)) = 1 if arm j is played by one and only one 
player at time tj{n), and lj{tj{n)) ~ otherwise. 

As mentioned in Sec. IH for both restless models, per- 
formance of any policy $ is evaluated using regret r$(t) 
defined as the reward loss with respect to having M best 
arms constantly engaged. Specifically, for both restless models, 
regret is defined as follows: 

M 

r^{t)^tJ2^^a(^)-^<S>R{t)+0{l), (3) 
1=1 

where the constant 0(1) is caused by the transient effects of 
playing the M best arms, E$ denotes the expectation with 
respect to the random process induced by policy $. The 
objective is to minimize the growth rate of the regret. Note 
that the constant 0(1) term can be ignored when studying the 
growth rate of the regret. 
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The general structure of decentralized RUCB 



Fig. 12] The epoch structure, (i.e., the starting and ending points 
of epochs) are prefixed numbers only depending on parameter 
D. This is one of the key reasons why different players can be 
coordinated (i.e., entering the same epoch at the same time) 
without intercommunications. 



Structure of the nth exploitation epoch for player 1 

Fig. 1. Epoch structures of decentralized RUCB 

III. The Decentralized RUCB Policy 

The proposed decentralized RUCB is based on an epoch 
structure. We divide the time into disjoint epochs. There 
are two types of epochs: exploitation epochs and exploration 
epochs (see an illustration in Fig. [T]!. In the exploitation 
epochs, the players calculate the indexes of all arms and play 
the arms with the M highest indexes, which are believed to 
be the M best arms. In the exploration epochs, the players 
obtain information of all arms by playing them equally many 
times. The purpose of the exploration epochs is to make 
decisions in the exploitation epochs sufficiently accurate. As 
shown in Fig. [T] in the nth exploration epoch, each player 
plays every arm 4"^^ times. At the beginning of the nth 
exploitation epoch the player calculates index for every arm 
(see (|5]l in Fig. |2]i and selects the arm with the M highest 
indexes (denoted as arm a*^-^^ to arm a^j^^j). Each exploitation 
epoch is divided into M subepochs with each having a length 
of 2 X 4"-i. Player k plays arm a((„_fe+Af+i)0M) i" *e 
TOth subepoch of each exploitation epoch. The details on 
interleaving the two types of epochs are given in Step 2 
in Fig. |2] Specifically, whenever sufficiently many (Dint, 
see (|4|i) observations have been obtained from every arm in 
the exploration epochs, the player is ready to proceed with a 
new exploitation epoch. Otherwise, another exploration epoch 
is required to gain more information about each arm. It is also 
implied in ^ that only logarithmically many plays are spent 
in the exploration epochs, which is one of the key reasons 
for the logaiithmic regret of decentralized RUCB. This also 
implies that the exploration epochs are much less frequent 
than the exploitation epochs. Though the exploration epochs 
can be understood as the "information gathering" phase, and 
the exploitation epochs as the "information utilization" phase, 
observations obtained in the exploitation epochs are also used 
in learning the arm dynamics. This can be seen in Step 3 in 
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Decentralized RUCB 

Time is divided into epochs. There are two types of epochs, 
exploration epochs and exploitation epochs. At the beginning of 
the nth exploitation epoch, we choose the M arms to play, each of 
them for 2 x 4"^^ many times. In the nth exploration epoch, we 
play every arm 4"^^ many times. Let noit) denote the number 
of exploration epochs played by time t and n/(t) the number of 
exploitation epochs played by time t. 

1. At f = 1, we start the first exploration epoch, in which every 
arm is played once. We set no(A'^ + 1) = 1, n/(A'' + 1) = 0. 
Then go to Step 2. 

2. Let Xi (t) = (4"o (*' - 1)/3 be the time spent on each arm in 
exploration epochs by time t. Choose D according to ([6f(|7}. 
If 



Xi(t) > Dint, 



(4) 



go to Step 3 (start an exploitation epoch). Otherwise, go to 
Step 4 (start an exploration epoch). 
3. Calculate indexes di,t for all arms using the formula below: 



Lint 



(5) 



where t is the current time, Si{t) is the sample mean from 
arm i by time t, L is chosen according to (|6j, and Ti{t) is 
the number of times we have played arm i by time t. Then 
choose the arms with the A/ highest indexes (arm a^jj to arm 
). Each exploitation epoch is divided into M subepochs 



with each having a length of 2 x 4"~ . Player k plays arm 
'^{{m-k+M+i)0M) subepoch of each exploitation 

epoch. After arm a'j^j to arm a^^^j are played, increase ni 
by one and go to step 2. 
4. Each Play each arm for 4(^0 slots. Each exploration 
epoch is divided into A'^ subepochs with each having a length 



of 4("o 1) Player k plays arm a*. 



m-fe + ]V+10]V) 



in the mth 



subepoch of each exploitation epoch. After all the arms are 
played, increase n/ by one and go to step 2. 



Fig. 2. Decentralized RUCB policy 



A. Eliminate Pre-Agreement 

So far we have assumed a pre-agreement among the players: 
they target at the M best arms with different offsets to avoid 
excessive collisions. In this subsection, we show that this pre- 
agreement can be eliminated while maintaining the logarithmic 
order of the system regret. Furthermore, players can join the 
system at different times without any global synchronization. 
Specifically, at each player, the structure of the exploration 
and exploitation epochs is the same as the local RUCB policy 
with pre-agreement. The only difference here is that in each 
exploitation epoch, the player randomly chooses one of the 
M arms considered as the best to play whenever a collision 
with other players is observed. If no collision is observed, the 
player keeps playing the same arm. This simple elimination 



of pre-agreement leads to a complete decentralization among 
players while achieving the same logarithmic order of the 
system regret. Except that each player can join the system 
according to the local schedule, the player can also leave the 
system for an arbitrary finite time period. 

IV. The Logarithmic Regret of decentralized 
RUCB 

In this section, we show that the regret achieved by the 
decentralized RUCB policy has a logarithmic order This is 
given in the following theorem. 

Theorem 1: Under the exogenous restless Markovian re- 
ward model, assume that when arms are engaged, they can 
be modeled as finite state, irreducible, aperiodic, and re- 
versible Markov chains. All the states (rewards) are posi- 
tive. Let TTmin = minse5, 4< i<A' fmax = maxi<j<Arei, 

emin = mini<i<Arei, S„iax — V[l&yiseSi,l<i<N S, Sniin — 

minsg5^_i<i<Ar s, and |5|max = maxi<j<Ar where = 
1 — Xi (Xi is the second largest eigenvalue of the matrix Pi). 
Assume that different arms have different /x values Set the 
policy parameters L and D to satisfy the following conditions: 



L > 



D > 



1 



-(4- 



(3-2^2) 



4L 



(6) 



(7) 



Under the conflict model where players share the reward, 
the regret of decentralized RUCB at the end of any epoch can 
be upper bounded by 



1 / ^ M ^ 

(t) < [4(3Dlnt+l)-l] K^/i.W-^E 

\i=l i=l 

+3riog4(|(t-iv) + 1)1(1 + ^;^) 



M-1 N 
i=l j=lj^i 



Sa(i) \ + \Sa[j)\ 



N 



j=M+l 



\^<j{i) I + \^cr{j) I 



^min 
^max 



+3riog4(-(i- TV) + 1)1(1 + 

Z lU5niin 



M-1 



\Sa{M) \ + \SaU)\ 



N 



Under the model where no player in conflict gets any 
reward, the regret of decentralized RUCB at the end of any 
epoch can be upper bounded by: 



1=1 ' seSi 



(8) 



'This assumption can be relaxed by utilizing the shared index set. This 
assumption is only for simplicity of the presentation. 



Mt) < 3riog4(|(i-A^) + 1)1(1 + 



) 



AI M N 

i=l i=l j^ij^i 

-[4(3L»lnt + 1) - 1] 



+ \s, 



\i=l i=l / 



N 

■E 



(min TTs) 

seSi 



"E 

seSi 



(9) 



We point out that upper bounds of regret in Theorem 1 
can be extended to any time t instead of only for ending 
points of epochs. They can also be extended to the endogenous 
restless model in terms of weak regret. The no pre-agreement 
version of decentralized RUCB can also achieve regret with a 
logarithmic order 

Proof: See Appendix A for details. ■ 

Theorem 1 requires an arbitrary (nontrivial) bound on s^^x' 
|5|,„ax, emin, and minj<A/(^^(j) -/x^-q+i)). In the case where 
these bounds are unavailable, D and L can be chosen to 
increase with time to achieve a regret order arbitrarily close 
to logarithmic order. This is formally stated in the following 
theorem. 

Theorem 2: Assume the exogenous restless model and that 
all arms, when engaged, are modeled as finite state, irre- 
ducible, aperiodic, and reversible Markov chains. For any 
increasing sequence f{t) {f{t) — > oo as < — > oo), if L{t) and 
D{t) are chosen such that i(t) — > oo as t — > oo oo 



as t — ^ oo, and 



D{t) 
L(t) 



OO as t — > OO, then we have 



r^{t)^o{f{t)\og{t)). 



(10) 



We point out that the conclusion in Theorem 2 still holds for 
the endogenous restless model, though the proof needs to be 
modified. 

Proof: See Appendix B for details. 

■ 

V. Conclusion 

In this paper, we studied the decentralized restless multi- 
armed bandit problems, where distributed players aim to 
accrue the maximum long-term reward without knowing the 
system reward statistics. Under the exogenous model where 
the arm reward status remains static when not engaged, we 
proposed a policy to achieve the optimal logarithmic order 
of the system regret. Under the endogenous model where the 
arm reward status evolves according to an arbitrary random 
process when not engaged, we showed that the proposed policy 
achieves a logarithmic (weak) regret. Furthermore, we showed 
that the proposed policy achieves a complete decentralization 



where no pre-agreement or global synchronization among 
players is required. 

Appendix A. Proof of Theorem[T] 
We first rewrite the definition of regret as 

M 



i=l 



N 



T.it) 



= ^[Ai.E[r,(t)]-E[^,s,(<,(n))]] 



n=l 
M 

i=l 



N 

E 

i=l 



^l^E[T,{t)]. 



(11) 



To bound the first term in (fTTt . Lemma 1 is introduced 
below: 

Lemma 1 |3|: Let Yi, 1^2, ■ ■ • be Markovian with state space 
S, matrix of transition probabilities P, an initial distribution q, 
and stationary distribution 7? (tt^ is the stationary probability of 
state s). Let Ft be the cr-algebra generated by Yi, I2, • • • ^Yt 
and G an cr-algebra independent of Yoo — VYf. Let T be a 
stopping time of {Ft V G}. The state (reward) at time t is de- 
noted by s{t). Let /i denote the mean reward. For any stopping 
time T, there exists a value Ap < (min^g^ tt^)^^ ^^.^^ s 
such that EELi s{t) - iiT] < Ap. 

Using Lemma 1 the first term in (fTTT l can be bounded by 
the following constant: 



N 

^[(mm^,)-! ^ , 

1=1 ' seSi 



(12) 



To show that the regret has a logarithmic order, it is 
sufficient to show that the second term plus the third term 
in ( fTTT l has a logarithmic order. These two terms can be 
understood as regret caused by two reasons. The first one is 
engaging bad arms in the exploration epochs. The second one 
is not playing the expected arms in the exploitation epochs. 
To show the second term in (fTTT l has a logarithmic order, it 
is sufficient to show that the regret caused by the two reasons 
above have logarithmic orders. 

Let E[ro(i)] denote the time spent on each arm in the 
exploration epochs by time t and an upper bound on To{t)] 
is: 



To{t) < ^moint 



I) 



1]- 



(13) 



Consequently the regret caused by engaging bad arms in 
the exploration epochs by time t is upper bounded by 

\j=l i=l / 



The second reason for regret in the second term of (fTTl is 
not playing the expected arms in the exploitation epochs. Let 
tn denote the beginning point to the nth exploitation epoch. 
Let Pr[i,j,n] denote the possibility that arm i has a higher 
index than arm j at t„, where fii < jij and /ij > fJ,a(M)- It 
can be shown that: 



Pr[i,j,n] < (1 



lOs ■ 



(15) 



Since different subepochs in the exploitation epochs are 
symmetric, the regret in different subepochs are the same. In 
the first subepoch, player k aims at arm cr(fc). In the model 
where players in conflict share the reward, player k failing to 
identify arm a{k) in the first subepoch of the nth exploitation 
epoch can lead to a regret no more than ficr(k)'^ x 4"^^. In 
calculating the upper bound for regret, for player M, we can 
assume that playing the arm a{M + 1) to arm (j{N) can 
contribute to the total reward. Thus an upper bound for regret 
in the nth exploitation epoch can be obtained as 



2M4"-^(1 



10s, 



Tin , . n 



M-1 



E 

N 



\Sm\ + \S,\ 



+ y. (MM -Mi) 

' * 7r___ 



(16) 



■j=M+l 



By time i, at most {t — N) time slots have been spent on 
the exploitation epochs. Thus 



niit) < \log^{-it-N) + l)^. 



(17) 



From the upper bound on the number of the exploitation 
epochs given in ( fTTT l. and also the fact that t„ > |4"~^, 
we have the following upper bound on regret caused in the 
exploitation epochs by time t (Denoted by r$j(t)): 



r^At) < 3\\ogA{t-N) + mi+'-^^) 



M-1 N 



E E ^-(*) 



I + \Sa{]) I 



, ^max 



-h3riog4(-(i-iv) + 1)1(1 , 



N 



E (A* 

a(M) - IJ-aU)) 

j=M+l 



\Sa{i) I + \Sa{j) I 



+3\\og,d{t-N) + imi + '-^f^) 



Af-1 



\Sa{M) I + I 



(18) 



Combining (fTTT i ( fT2] i (fl4] i (fTSl l. we can get the upper bound The regret caused by playing bad ai'ms in the exploration 
of regret: epochs is bounded by 



1 / ^ 

\i=l 



M ^ ' 



+3riog,(|(i-iv) + 1)1(1 + 



M-1 N 

i=l j = l,j:^i 



+3\iog,dit-N) + mi + '-:^^] 

2 lOSmin 



1 M ^ \ 

-mDit) Ini + 1) - 1] ^ - ^ E ^-W ■ (21) 

Since — > oo as t — > oo, the part of regret in (l2ll is on a 
lower order than /(t) log(i). 

For the regret caused by playing bad arms in the exploitation 
epochs, it is shown below that the time spent on a bad arm i 
can be bounded by a constant independent of t. 



N 



j=M+l 



l'5<T(i) I + \SaU) I 



+3\log,{-it-N) + imi+ 



\Scr{M) I + I 



^min 
^max 



Af-l 



(M) ■ 



N 



i=l ' se5i 



(19) 



Next we consider the model where no player in conflict gets 
reward. In the first subepoch of the 7ith exploitation epcoh, 
each mistake by player k can cause regret more than iJ,cr(k)'2 x 
4"~^. Assuming each mistake can cause X^ff i l^a-{i)'^ ^ 4"^i 
regret leads to the foUowing upper bound for regret under this 
conflict model: 



Mt) < 3riog4(i-^) + 1)1(1 + 

M M N 



(Z^M.w)Z^ — 



1=1 1=1 j=lj^i 



M 



i=l 



M 

TV 



i=l / 



AT 



+ ^[(mhi7r,) i^s] 

1=1 ' sG5i 

Appendix B. Proof of Theorem[2] 



(20) 



The choice of L{t) and D{t) implies that D{t) — > oo as 
t — > oo. The regret has three parts: the transient effect of 
arms, the regret caused by playing bad arms in the exploration 
epochs, and the regret caused by mistakes in the exploitation 
epochs. It will be shown that each part part of the regret is 
on a lower order than /(t) log(i). The transient effect of arms 
is the same as in Theorem 1. Thus it is upper bounded by a 
constant independent of time t and is on a lower order than 

/(t)logW- 



Since 



D(t) 
L(t) 



OO as < — > oo, there exists a time 

4£(t) 



ti such that Vt > ti, Dit) > -r-. ; ■ rr^. 

There also exists a time t2 such that Vi > t2, L{t) > 



:(7 



20<axl-S|„a> 



lO^max)- The time spent on playing 



(3-2^2) 

bad arms before = max(<i,/;2) is at most ^3, and 
the caused regret is at most f^<r(j))t3- The regret 

caused by mistakes after ^3 is upper bounded by 6(1 + 



Thus the 



regret caused by mistakes in the exploitation epochs is on a 
lower order than f{t) log(t). 

Because each part of the regret is on a lower order than 
f{t) \og{t), the total regret is also on a lower order than 
/Wlog(i)- 
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